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We present a decomposition-based approach to manag- 
ing incomplete information. We introduce world-set decom- 
positions (WSDs), a space-efficient and complete represen- 
tation system for finite sets of worlds. We study the prob- 
lem of efficiently evaluating relational algebra queries on 
world-sets represented by WSDs. We also evaluate our tech- 
nique experimentally in a large census data scenario and 
show that it is both scalable and efficient. 



1 Introduction 

Incomplete information is commonplace in real-world 
databases. Classical examples can be found in data inte- 
gration and wrapping applications, linguistic collections, or 
whenever information is manually entered and is therefore 
prone to inaccuracy or partiality. 

There has been little research so far into expressive yet 
scalable systems for representing incomplete information. 
Current techniques can be classified into two groups. The 
first group includes representation systems such as v-tables 
iFTSll and or-set relations 1 16 1 which are not strong enough 
to represent the results of relational algebra queries within 
the same formalism. In v-tables the tuples can contain both 
constants and variables, and each combination of possible 
values for the variables yields a possible world. Relations 
with or-sets can be viewed as v-tables, where each variable 
occurs only at a single position in the table and can only 
take values from a fixed finite set, the or-set of the field oc- 
cupied by the variable. The so-called c-tables \ 15 1 belong to 
the second group of formalisms. They extend v-tables with 
conditions specified by logical formulas over the variables, 
thus constraining the possible values. Although c-tables are 
a strong representation system, they have not found appli- 
cation in practice. The main reason for this is probably that 
managing c-tables directly is rather inefficient. Even very 
basic problems such as deciding whether a tuple is in at least 
one world represented by the c-table are intractable |[3). 



As a motivation, consider two manually completed 
forms that may originate from a census and which allow 
for more than one interpretation (Figure [I]). For simplic- 
ity we assume that social security numbers consist of only 
three digits. For instance, Smith's social security number 
can be read either as "185" or as "785". We can represent 
the available information using a relation with or-sets: 
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It is easy to see that this or-set relation represents 2 • 2 • 
2 ■ 4 = 32 possible worlds. 

Given such an incompletely specified database, it must 
of course be possible to access and process the data. Two 
data management tasks shall be pointed out as particularly 
important, the evaluation of queries on the data and data 
cleaning lfT71 [T3, 18], by which certain worlds can be shown 
to be impossible and can be excluded. The results of both 
types of operation turn out not to be representable by or- 
set relations in general. Consider for example the integrity 
constraint that all social security numbers be unique. For 
our example database, this constraint excludes 8 of the 32 
worlds, namely those in which both tuples have the value 
185 as social security number. It is impossible to repre- 
sent the remaining 24 worlds using or-set relations. This is 
an example of a constraint that can be used for data clean- 
ing; similar problems are observed with queries, e.g., the 
query asking for pairs of persons with differing social secu- 
rity numbers. 

What we could do is store each world explicitly using 
a table called a world-set relation of a given set of worlds. 
Each tuple in this table represents one world and is the con- 
catenation of all tuples in that world (see Figure|2j. 

The most striking problem of world-set relations is their 
size. If we conduct a survey of 50 questions on a popula- 
tion of 200 million and we assume that one in 10 4 answers 
can be read in just two different ways, we get 2 10 worlds. 
Each such world is a substantial table of 50 columns and 
2 ■ 10 8 rows. We cannot store all these worlds explicitly in 
a world-set relation (which would have 10 10 columns and 



1 



Social Security Number: 
Name: 



Marital Status: (1) single ^(2) married M^t 
(3) divorced □ (4) widowed □ 



Social Security Number: 

Name: fisXQt~)h 



Marital Status: (1) single □ (2) married □ 
(3) divorced □ (4) widowed □ 



Figure 1. Two completed survey forms. 
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Figure 2. World-set relation containing only 
worlds with unique social security numbers. 



2 10 rows). Data cleaning will often eliminate only some of 
these worlds, so a DBMS should manage those that remain. 

This article aims at dealing with this complexity and pro- 
poses the new notion of world-set decompositions (WSDs). 
These are decompositions of a world-set relation into sev- 
eral relations such that their product (using the product op- 
eration of relational algebra) is again the world-set relation. 

Example 1.1 The world-set represented by our initial or- 
set relation can also be represented by the product 



ti.S 



185 

785 



ti.N 



Smith 



*i.M 



t 2 .S 



185 
186 



t 2 .N 



Brown 



t 2 M 



Example 1.2 In the same way we can represent the result 
of data cleaning with the uniqueness constraint for the social 
security numbers as the product of Figure [3] 

One can observe that the result of this product is exactly 
the world-set relation in Figure The presented decompo- 
sition is based on the independence between sets of fields, 
subsequently called components. Only fields that depend on 
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Figure 3. WSD of the relation in Figure 2 



each other, for example ti.S and t 2 .S, belong to the same 
component. Since {ti.S, t2-S} and {ti.M} are indepen- 
dent, they are put into different components. □ 

Often, one can quantify the certainty of a combination of 
possible values using probabilities. For example, an auto- 
matic extraction tool that extracts structured data from text 
can produce a ranked list of possible extractions, each asso- 
ciated with a probability of being the correct one [ 14]. 

WSDs can elegantly handle such scenarios by simply 
adding a new column Pr to each component relation, which 
contains the probability for the corresponding combination 
of values. 
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Figure 4. Probabilistic version of the WSD of 
Figure |3j 



Example 1.3 Figure [4] shows a probabilistic version of the 
WSD of Figure [3] The probabilities in the last component 
imply that the possible values for the marital status of ti are 
equally likely, whereas ti is more likely to be single than 
married. The probabilities for the name values for ti and ti 
equal one, as this information is certain. □ 

Given a probabilistic WSD {Ci, . . . ,C m }, we obtain a 
possible world by choosing one tuple Wi out of each com- 
ponent relation C,. The probability of this world is then 
computed as J| Wi.Pr. For example, in Figure [4] choosing 

i 

the first, the second and the third tuple from the first, the 
third and the fifth component, respectively, results in the 
world 
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The world's probability can be computed as 0.2 • 0.3 • 0.25 
0.015. 



In practice, it is often the case that fields or even tu- 
ples carry the same values in all worlds. For instance, in 
the census data scenario discussed above, we assumed that 
only one field in 10000 has several possible values. Such a 
world-set decomposes into a WSD in which most fields are 
in component relations that have precisely one tuple. 

We will also consider a refinement of WSDs, WSDTs, 
which store information that is the same in all possible 
worlds once and for all in so-called template relations. 

Example 1.4 The world-set of the previous examples can 
be represented by the WSDT of Figure [5] □ 
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Figure 5. Probabilistic WSD with a template 
relation. 



WSDTs combine the advantages of WSDs and c-tables. 
In fact, WSDTs can be naturally viewed as c-tables whose 
formulas have been put into a normal form represented by 
the component relations, and null values '?' in the template 
relations represent fields on which the worlds disagree. In- 
deed, each tuple in the product of the component relations 
is a possible value assignment for the variables in the tem- 
plate relation. The following c-table with global condition 
$ is equivalent to the WSDT in Figure [5] (modulo the prob- 
abilistic weights): 
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Smith 


y 
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Brown 
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q> = (( x = 185 A z = 186) V (x = 785 A z = 185) V 
(x = 785 A z = 186)) A(y = lVy = 2)A 
(tii = lVm = 2Vw = 3Vto = 4) 

The technical contributions of this article are as follows. 

• We formally introduce WSDs and WSDTs and study 
some of their properties. Our notion is a refinement 
of the one presented above and allows to represent 
worlds over multi-relation schemas which contain re- 
lations with varying numbers of tuples. WSD(T)s can 
represent any finite set of possible worlds over rela- 
tional databases and are therefore a strong representa- 
tion system for any relational query language. 



• A practical problem with WSDs and WSDTs is that a 
DBMS that manages such representations has to sup- 
port relations of arbitrary arity: the schemata of the 
component relations of a decomposition depend on the 
data. Unfortunately, DBMS (e.g. PostgreSQL) in prac- 
tice often do not support relations beyond a fixed arity. 

For that reason we present refinements of the notion 
of WSDs, the uniform WSDs (UWSDs), and their ex- 
tension by template relations, the UWSDTs, and study 
their properties as representation systems. 

• We show how to process relational algebra queries 
over world-sets represented by UWSDTs. For illus- 
tration purposes, we discuss query evaluation in the 
context of the much more graphic WSDs. 

We also develop a number of optimizations and tech- 
niques for normalizing the data representations ob- 
tained by queries to support scalable query processing 
even on very large world-sets. 

• We describe a prototype implementation built on top 
of the PostgreSQL RDBMS. Our system is called 
MayBMS and supports the management of incomplete 
information using UWSDTs. 

• We report on our experimental evaluation of UWSDTs 
as a representation system for large finite sets of pos- 
sible worlds. Our experiments show that UWSDTs al- 
low highly scalable techniques for managing incom- 
plete information. We found that the size of UWSDTs 
obtained as query answers or data cleaning results re- 
mains close to that of a single world. Furthermore, the 
processing time for queries on UWSDTs is also com- 
parable to processing just a single world and thus a 
classical relational database. 

• For our experiments, we develop data cleaning tech- 
niques in the context of UWSDTs. To clean data of in- 
consistent worlds we chase a set of equality-generating 
dependencies on UWSDTs, which we brief-ly de- 
scribe. 

WSDs are designed to cope with large sets of worlds, 
which exhibit local dependencies and large commonalities. 
Note that this data pattern can be found in many applica- 
tions. Besides the census scenario, Section[9]describes two 
further applications: managing inconsistent databases using 
minimal repairs [7 9 | and medicine data. 

A fundamental assumption of this work is that one wants 
to manage finite sets of possible worlds. This is justified 
by previous work on representation systems starting with 
Imielinski and Lipski [ 15 1, by recent work Ifl2l l4l[8l. and 
by current application requirements. Our approach can 
deal with databases with unresolved uncertainties. Such 
databases are still valuable. It should be possible to do 



data transformations that preserve as much information as 
possible, thus necessarily mapping between sets of possi- 
ble worlds. In this sense, WSDs represent a compositional 
framework for querying and data cleaning. A different ap- 
proach is followed in, e.g., 1171 [lOl , where the focus is on 
finding certain answers of queries on incomplete and in- 
consistent databases. 

Related Work. The probabilistic databases of fl2] [HI and 
the dirty relations of [4 1 are examples of practical represen- 
tation systems that are not strong for relational algebra. As 
query answers in general cannot be represented as a set of 
possible worlds in the same formalism, query evaluation is 
focused on computing the certain answers to a query, or the 
probability of a tuple being in the result. Such formalisms 
close the possible worlds semantics using clean answers [4| 
and probabilistic -ranked retrieval [12]. As we will see in 
this article, our approach subsumes the aforementioned two 
and is strictly more expressive than them. 

In parallel to our approach, 12T1 propose ULDBs 
that combine uncertainty and a low-level form of lineage 
to model any finite world-set. Like the dirty relations of 
|4j, ULDBs represent a set of independent tuples with al- 
ternatives. Lineage is then used to represent dependencies 
among alternatives of different tuples and thus is essential 
for the expressive power of the formalism. 

As both ULDBs and WSDs can model any finite world- 
set, they inherently share some similarities, yet differ in im- 
portant aspects. WSDs support efficient algorithms for find- 
ing a minimal data representation based on relational factor- 
ization. Differently from ULDBs, 

WSDs allow representing uncertainty at the level of tuple 
fields, not only of tuples. This causes, for instance, or-set re- 
lations to have linear representations as WSDs, but (in gen- 
eral) exponential representations as ULDBs. As reported in 
0, resolving tuple dependencies, i.e., tracking which alter- 
natives of different tuples belong to the same world, often 
requires to compute expensive lineage closure. Addition- 
ally, query operations on ULDBs can produce inconsisten- 
cies and anomalies, such as erroneous dependencies and in- 
existent tuples. In contrast, WSDs share neither of these 
pitfalls. As no implementation of ULDBs was available at 
the time of writing this document, no experimental compar- 
ison of ULDBs and WSDs could be established. 

2 Preliminaries 

We use the named perspective of the relational model 
with the operations selection a, projection ir, product 
x, union U, difference — , and attribute renaming S 
(cf. e.g. 0). A relational schema is a tuple £ = 
(Ri[Ui\, . . . ,Rk[Uk\), where each Ri is a relation name 
and Ui is a set of attribute names. Let D be a set of do- 
main elements. A relation over schema R[Ai, . . . , Ak] is a 



set of tuples (Ai : a\, . . . , A^ : ak) where a\, . . . ,ak G D. 
A relational database A over schema £ is a set of relations 
R A , one for each relation schema R[U] from £. Some- 
times, when no confusion of database may occur, we will 
use R rather than R A to denote one particular relation over 
schema R[U]. By the size of a relation R, denoted \R\, we 
refer to the number of tuples in R. For a relation R over 
schema R[U], let S(R) denote the set U of its attributes 
and let ar(R) denote the arity of R. 

A product m-decomposition of a relation R is a set of 
non-nullary relations {C\, . . . , C m } such that C\ X • • • X 
C m = R. The relations C\ , . . . , C m are called components. 
A product m-decomposition of R is maximal if there is no 
product rt-decomposition of R with n > m. 

A set of possible worlds (or world-set) over schema £ is 
a set of databases over schema £. Let W be a set of struc- 
tures, rep be a function that maps from W to world-sets of 
the same schema. Then (W, rep) is a strong representation 
system for a query language if, for each query Q of that lan- 
guage and each W S W such that Q is applicable to the 
worlds in rep(W), there is a structure W' € W such that 
rep(W') = {Q{A) I A £ rep(W)}. Obviously, 

Lemma 2.1 If rep is a function from a set of structures W 
to the set of all finite world-sets, then (W, rep) is a strong 
representation system for any relational query language. 

3 Probabilistic World- Set Decompositions 

In order to use classical database techniques for storing 
and querying incomplete data, we develop a scheme for rep- 
resenting a world-set A by a single relational database. 

Let A be a finite world-set over schema E = 
. . . , Rk). For each R in E, let |i?| max = max{|i?' 4 | : 
A G A} denote the maximum cardinality of relation 
R in any world of A. Given a world A with R A = 
{ii, . . . , ti^Ai}, let t R A be the tuple obtained as the con- 
catenation (denoted o) of the tuples of R A in an arbitrary 
order padded with a special tuple t± = (_L, . . . , _L) up to 

ar{R) 

arity |i?| max : 

t R A := ti o • • • O t\ R A\ o (t_L, ,tx) 

Then tuple tj^ := t R AO - ■ - ot R A encodes all the information 
in world A. The "dummy" tuples with _L-values are only 
used to ensure that the relation R has the same number of 
tuples in all worlds in A. We extend this interpretation and 
generally define as t± any tuple that has at least one symbol 
_L, i.e., (Ai : a\, A n : a n ), where at least one is J_, is 
a f j_ tuple. This allows for several different inlinings of the 
same world-set. 



By a world-set relation of a world-set A, we denote the 
relation {tj[ | A G A}. This world-set relation has schema 
{R.t t .A 3 | R[U] G E,l < * < |i?| ma x,A, G 17}. Note 
that in defining this schema we use t.- L to denote the position 
(or identifier) of tuple f j in t R A and not its value. 

Given the above definition that turned every world in a 
tuple of a world-set relation, computing the initial world-set 
is an easy exercise. In order to have every world-set relation 
define a world-set, let a tuple extracted from some t R A be 
in R A iff it does not contain any occurrence of the special 
symbol L. That is, we map t R A = (ai , . . . , a>ar(R)-\R\ mat J) 
to R A as 

tli A ^ {(<W(fl)-fe+lj ■ ■ • , a a r(i?)-(fe+l)) I < k < |i?| max , 
a ar(R)-k+l 7^ -L, • • • , 0-ar{R) ■ (fc+1) 7^ -L}- 

Observe that although world-set relations are not unique 
as we have left open the ordering in which the tuples of 
a given world are concatenated, all world-set relations of 
a world-set A are equally good for our purposes because 
they can be mapped invariantly back to A. Note that for 
each world-set relation a maximal decomposition exists, is 
unique, and can be efficiently computed [6]. 

Definition 3.1 Let A be a world-set and W a world-set re- 
lation representing A. Then a world-set m-decomposition 
(m-WSD) of A is a product m-decomposition of W. 

We will refer to each of the m elements of a world-set m- 
decomposition as components, and to the component tuples 
as local worlds. Somewhat simplified examples of world- 
set relations and WSDs over a single relation R (thus "R" 
was omitted from the attribute names of the world-set re- 
lations) were given in Section [T] Further examples can be 
found in Section[4] It should be emphasized that with WSDs 
we can also represent multiple relational schemata and even 
components with fields from different relations. 

It immediately follows from our definitions that 

Proposition 3.2 Any finite set of possible worlds can be 
represented as a world-set relation and as a 1- WSD. 

Corollary 3.3 (Lemma 12. Il l WSDs are a strong represen- 
tation system for any relational query language. 

As pointed out in Section [1] this is not true for or-set 
relations. For the relatively small class of world-sets that 
can be represented as or-set relations, the size of our repre- 
sentation system is linear in the size of the or-set relations. 
As seen in the examples, our representation is much more 
space-efficient than world-set relations. 

Modeling Probabilistic Information. We can quantify 
the uncertainty of the data by means of probabilities us- 
ing a natural extension of WSDs. A probabilistic world- 
set m-decomposition (probabilistic m-WSD) is an m-WSD 



{Ci, . . . , C m }, where each component relation C has a spe- 
cial attribute Pr in its schema defining the probability for 
the local worlds, that is, for each combination of values de- 
fined by the component. We require that the probabilities in 

a component sum up to one, i.e. tc-Pr = 1. 

t c ec 

Probabilistic WSDs generalize the probabilistic tuple- 
independent model of lT2l . as we show next. Figure [6] (a) 
is an example taken from Ifl2l . It shows a probabilistic 
database with two relations S and T. Each tuple is assigned 
a confidence value, which represents the probability of the 
tuple being in the database, and the tuples are assumed inde- 
pendent. A possible world is obtained by choosing a subset 
of the tuples in the probabilistic database, and its probability 
is computed by multiplying the probabilities for selecting a 
tuple or not, depending on whether that tuple is in the world. 
The set of possible worlds for D is given in Figure[6](b). For 
example, the probability of the world D3 can be computed 
as (1-0.2) -0.5 -0.6 = 0.06. 
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Figure 6. A probabilistic database for rela- 
tions S and T (a), and the represented set of 
possible worlds (b). 



We obtain a probabilistic WSD in the following way. 
Each tuple t with confidence c in a probabilistic database 
induces a WSD component representing two local worlds: 
the local world with tuple t and probability c, and the empty 
world with probability 1 — c. Figure Q gives the WSD en- 
coding of the probabilistic database of Figure [6] Of course, 
in probabilistic WSDs we can assign probabilities not only 
to individual tuples, but also to combinations of values for 
fields of different tuples or relations. 
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Figure 7. WSD equivalent to the probabilistic 
database in Figure|H(a). 



Adding Template Relations. We now present our refine- 
ment of WSDs with so-called template relations. A tem- 
plate stores information that is the same in all possible 
worlds and contains special values '?' ^ D in fields at which 
different worlds disagree. 

Let £ = (Ri,...,Rk) be a schema and A a fi- 
nite set of possible worlds over S. Then, the database 
(Rl,...,R° k ,{Cx,...,C m }) is called an m-WSD with 
template relations (m-WSDT) of A iff there is a WSD 
{Ci, . . . , C m ,Di, . . . , D n } of A such that \D t \ = 1 for 
all i and if relation Di has attribute Rj.t.A and value v in 
its unique R 3 .t. A-field, then the template relation R® has a 
tuple with identifier t whose A-field has value v. 

Of course WSDTs again can represent any finite world- 
set and are thus a strong representation system for any rela- 
tional query language. Example ll.4l shows a WSDT for the 
running example of the introduction. 
Uniform World-Set Decompositions. In practice database 
systems often do not support relations of arbitrary arity 
(e.g., WSD components). For that reason we introduce next 
a modified representation of WSDs called uniform WSDs. 
Instead of having a variable number of component relations, 
possibly with different arities, we store all values in a single 
relation C that has a fixed schema. We use the fixed schema 
consisting of the three relation schemata 

C[FID, LWID, VAL],F[FID, CID],W[CID, LWID, PR] 

where FID is a tripleQ (Rel, TuplelD , Attr) denoting the 
j4ttr-field of tuple TuplelD in database relation Rel. 

In this representation we need a restricted flavor of 
world-ids called local world-ids (LWIDs). The local world- 
ids refer only to the possible worlds within one component. 
LWIDs avoid the drawbacks of "global" world IDs for the 
individual worlds. This is important, since the size of global 
world IDs can exceed the size of the decomposition itself, 
thus making it difficult or even impossible to represent the 
world-sets in a space-efficient way. If any world-set over a 
given schema and a fixed active domain is permitted, one 
can verify that global world-ids cannot be smaller than the 
largest possible world over the schema and the active do- 
main. 

Given a WSD {C\, ...,C m } with schemata Ci[Ui], we 
populate the corresponding UWSD as follows. 

• ((R, t, A), s,v) E C iff, for some (unique) i, R.t.A E 
Ui and the field of column R.t.A in the tuple with id s 
of Ci has value v. 

• F := {((R, t, A), C l )\l<i<m, R.t.A E Ui}, 

• (Ci, s,p) E W iff there is a tuple with identifier s in 
Ci, whose probability is p. 
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'That is, FID really takes three columns, but for readability we keep 
them together under a common name in this section. 



Figure 8. A UWSDT corresponding to the 
WSDT of Figuregl 



Intuitively, the relation C stores each value from a com- 
ponent together with its corresponding field identifier and 
the identifier of the component-tuple in the initial WSD 
(column LWID of C). The relation F contains the map- 
ping between tuple fields and component identifiers, and W 
keeps track of the worlds present for a given component. 

In general, the VAL column in the component relation C 
must store values for fields of different type. One possibility 
is to store all values as strings and use casts when required. 
Alternatively, one could have one component relation for 
each data type. In both cases the schema remains fixed. 

Finally, we add template relations to UWSDs in com- 
plete analogy with WSDTs, thus obtaining the UWSDTs. 

Example 3.4 We modify the world-set represented in Fig- 
ure 2] such that the marital status in ti can only have the 
value 3. Figure|8]is then the uniform version of the WSDT 
of Figure [4] Here i?° contains the values that are the same 
in all worlds. For each field that can have more than one 
possible value, R° contains a special placeholder, denoted 
by '?'. The possible values for the placeholders are defined 
in the component table C. In practice, we can expect that 
the majority of the data fields can take only one value across 
all worlds, and can be stored in the template relation. □ 

Proposition 3.5 Any finite set of possible worlds can be 
represented as a 1-UWSD and as a 1-UWSDT. 

It follows again that UWSD(T)s are a strong representa- 
tion system for any relational query language. 

4 Queries on World-set Decompositions 

In this section we study the query evaluation problem for 
WSDs. As pointed out before, UWSDTs are a better repre- 
sentation system than WSDs; nevertheless WSDs are sim- 



pier to explain and visualize and the main issues regarding 
query evaluation are the same for both systems. 

The goal of this section is to provide, for each relational 
algebra query Q, a query Q such that for a WSD W, 

rep(Q(W)) = {Q{A) | A G rep(W)}. 

Of course we want to evaluate queries directly on WSDs 
using Q rather than process the individual worlds using the 
original query Q. 

The algorithms for processing relational algebra que-ries 
presented next are orthogonal to whether or not the WSD 
stores probabilities. According to our semantics, a query is 
conceptually evaluated in each world and extends the world 
with the result of the query in that world. A different class 
of queries are those that close the possible world semantics 
and compute confidence of tuples in the result of a query. 
This will be the subject of Section[6] 

When compared to traditional query evaluation, the eval- 
uation of relational queries on WSDs poses new challenges. 
First, since decompositions in general consist of several 
components, a query Q that maps from one WSD to another 
must be expressed as a set of queries, each of which defines 
a different component of the output WSD. Second, as cer- 
tain query operations may cause new dependencies between 
components to develop, some components may have to be 
merged (i.e., part of the decomposition undone using the 
product operation x). Third, the answer to a (sub)query Qq 
must be represented within the same decomposition as the 
input relations; indeed, we want to compute a decomposi- 
tion of world set {(A, Qo(A)) \ A G rep(W)} in order 
to be able to resort to the input relations as well as the re- 
sult of Qq within each world. Consider for example a query 
(Ja=i(R) U CTB =2 (i?). If we first compute <ja=i{R), we 
must not replace R by <7a=i(R), otherwise R will not be 
available for the computation of <jb=2{R)- On the other 
hand, if (Ja=i (R) is stored in a separate WSD, the connec- 
tion between worlds of R and the selection (Ja=\ is lost and 
we can again not compute ga=\(R) U <jb=2(R)- 

We say that a relation P is a copy of another relation R 
in a WSD if R and P have the same tuples in every world 
represented by the WSD. For a component C, an attribute 
R.t.Ai of C and a new attribute P.t.B, the function ext 
extends C by a new column P.t.B that is a copy of R.t.Ai: 

ext{C,Ai,B) := {(At :a 1 ,...,A n : a n ,B : a*) 
(At : oi, ■ • ■ ,A n : a n ) G C} 

Then copy(i?, P) executes C := ext(C, R.ti.A, P.ti.A) 
for each component C and each R.ti.A G S(C). 

The implementation of some operations requires the 
composition of components. Let C\ and C 2 be two com- 
ponents with schemata (A%, . . . , Ak, Pr), and 
(B\ , . . . , Bi , Pr), respectively. Then the composition of C% 



and C2 is defined as: 

compose (Ci, C 2 ) := 

{(Ax : a%, . . . ,A k : a k , Si : h, . • . , B t : b u 
Pr : pi ■ P2) I 

(Ax : <zi, . . . , A k : a k , Pr : p x ) G Ci, 
(B 1 :b 1 ,...,Bi:b l ,Pr:p 2 )eC 2 } 

In the non-probabilistic case the composition of compo- 
nents is simply the relational product of the two compo- 
nents. 

Figure |9]presents implementations of the relational alge- 
bra operations selection (of the form oabc or a abb, where 
A and B are attributes, c is a constant, and 6 is a compari- 
son operation, =, j^, <, <, >, or >), projection, relational 
product and union on WSDs. In each case, the input WSD 
is extended by the result of the operation. 

Let us now have a closer look at the evaluation of rela- 
tional algebra operations on WSDs. For this, we use as run- 
ning example the set of eight worlds over the relation R of 
Figure[lO](a) and its maximal 7-WSD of Figure[lO](b). The 
second component (from the left) of the WSD spans over 
several tuples and attributes and each of the remaining six 
components refer to one tuple and one attribute. The first 
tuple of the second component of the WSD of Figure [TOl 
contains the values for R.t\.B, R.t\.C, and R.t 2 -B, i.e. 
some but not all of the attributes of the first and second tu- 
ple of R A , for all worlds A. Because of space limitations 
and our attempt to keep the WSDs readable, we consistently 
show in the following examples only the WSDs of the result 
relations. 

Selection with condition A9c . In order to compute a selec- 
tion P := aA6c(R), we first compute a copy P of relation 
R and subsequently drop tuples of P that do not match the 
selection condition. 

Dropping tuples is a fairly subtle operation, since tuples 
can spread over several components and a component can 
define values for more than one tuple. 

Thus a selection must not delete tuples from component 
relations, but should mark fields as belonging to deleted tu- 
ples using the special value _L. To evaluate aAec(R), our 
selection algorithm of Figure [9] checks for each tuple ti in 
the relation P and tc in component C with attribute P.ti.A 
whether tc .(P.U.A)0c. In the negative case the tuple P.U 
is marked as deleted in all worlds that take values from tc- 
For that, tc-(P-ti-A) is assigned value _L, and all other at- 
tributes P.ti.A' of C referring to the same tuple ti of P are 
assigned value _L in tc, (cf. the algorithm propagate-^ of 
Figure fTzb. This assures that if we later project away the 
attribute A of P, we do not erroneously "reintroduce" tuple 
P.ti into worlds that take values from tc- 

Example 4.1 FigurefTTIshows the answers to <jc=t(R) and 
(Jb=i(R)- Note that the resulting WSDs should contain 



algorithm select[j4f7c] // compute P := caocR 




begin 




^nnu/ P P\ ■ 




for each 1 < i < |-P| m ax do begin 




let C be the component of P.ti.A; 




for each tc £ C do 




it not \ tc \i -ti .J± ) u c) tnen 


algorithm project[(7j // compute P := tvu{R) 


IC-\F .ti.Jl) . — _L 


begin 


prupdydic 


nr\ri\/{ P F>\ • 
oupy ^xt, 1 ) , 


end 


for each 1 ^ i ^ max do 


end 


while no fixpoint is reached do begin 




let C be the component of P.t% .A, where A £ U\ 


algorithm selectee's] // compute P := oabbR 


let C' ^ C be the component of P.ti.B, where 


begin 


D ^ C dntl ^ V-ri xz U . r.Li.JA. rp O ) ) dllU 


r*r\r\\if P P \ ■ 




for each 1 < 2 < |-P|max do begin 


replace components 0, uy . — cornposGi^o, j, 


let C be the component of P.ti.A; 


prupdydLc J-l^O J, 
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end 
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project away P.ti.B from C\ 
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LC .{-r.Ti.Si) .— _L 


end 


nmnariQto. 1 ^ /"~*A* 
prupdycUc J-l^O j, 


end 


end 




end 


algorithm rename // compute o^—,^/ \R) 




begin 


algorithm product // compute _/ := R x o 


for each 1 <C i <C i? max do begin 


begin 


let C be the component of R.ti.A; 


ior eacn i \ j _^ | o max anu jx.c 4 ./i t o\jx) ao oegin 




let C be the component of R.ti.A; 


end; 


s~i ovt/T' P + -4 T -1 + -4 \ . 

O . — GXT^O, iC.ti.Ji.) 1 .tij.Ji.), 


end 


end; 




ior eacn 1^*2; |-*Hmax ana j.ij .s± t oi^oj ao oegin 


algorithm dinerence // compute P := R — 


let C" be the component of S.tj.A; 


begin 


O . — tJXl^O , D.Tj .si, 1 .lij 


nr\ri\/{ P PA ■ 
COpy { ti, F), 


end 


for each 1 < z < P| max do 


end 


for each 1 < j < l^lmaxdo 




let Ci, . . . , Cfc be the components for the fields of P.ti and S'.ij ; 


algorithm union // compute T := RU & 


replace Oi, . . . , uyo . — compose 1 , . . . , Ofc J, 


begin 


for eiicli G (7 do begin 


ior eacn 1 \ 2. \ max anu t oyri) ao oegin 
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let (7 be the component of R.ti.A; 


tc\f.ti.A) := _L; 


C := ext(C,R.ti.A 7 T.(R.ti)-Ay, 


end 


end; 


end 


for each 1 < j < \S\ max and A e S(S) do begin 


end 


let C' be the component of S.tj.A: 

r J ' 




C := QYX(C f ,S.t 3 .A,T.{S.t 3 ).A)\ 




end 




end 





Figure 9. Evaluating relational algebra operations on WSDs. 
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(a) Set of eight worlds of the relation R. 
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(b) 7-WSD of the world-set of (a). 
Figure 10. World-set and its decomposition. 
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(a) P := crc=7 (R) applied to the WSD of FigurefTol 
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Figure 11. Selections P := (j c =7{R) and P := <t b =i{R) with R from Figure [10] 



algorithm propagate-_L(C: component) 
begin 

for each t c € C and P.U.A G S(C) do 
if tc-(P.ti.A) = _Lthen 

for each A' such that P.U.A' G S{C) do 
t c .(P.U.A') := JL; 

end 



Figure 12. Propagating _L-values. 

both the query answer P and the original relation R, but due 
to space limitations we only show the representation of P. 
One can observe that for both results in FigureQT|we obtain 
worlds of different sizes. For example the worlds that take 
values from the first tuple of the second component relation 
in Figure QT| (a) do not have a tuple t\, while the worlds 
that take values from the second tuple of that component 
relation contain t\. □ 

Selection with condition A6B . The main added difficulty 
of selections with conditions AOB as compared to selec- 
tions with conditions A8c is that it creates dependencies be- 
tween two attributes of a tuple, which do not necessarily 
reside in the same component. 

As the current decomposition may not capture exactly 
the combinations of values satisfying the join condition, 
components that have values for A and B of the same tuple 
are composed. After the composition phase, the selection 
algorithm follows the pattern of the selection with constant. 



Example 4.2 Consider the query <ja=b{R), where R is 
represented by the 7-WSD of Figure [10] Figure [13] shows 
the query answer, which is a 4- WSD that represents five 
worlds, where one world has three tuples, three worlds have 
two tuples each, and one world has one tuple. □ 

Product . The product T := R x S of two relations R and 
S, which have disjunct attribute sets and are represented by 
a WSD requires that the product relation T extends a com- 
ponent C with (S'lmaz (respectively |-R| maa: ) copies of each 
column of C with values of R (respectively S). Addition- 
ally, the ith O'th) copy is named T.tij.A if the original has 
name R.ti.A or S.tj.A. 

Example 4.3 Figure [14] (b) shows the WSD for the prod- 
uct of relations R and S represented by the WSD of Fig- 
ure[l4](a). To save space, the relations R and S have been 
removed from Figure [14] (b), and attribute names do not 
show the relation name "T". □ 

Projection. A projection P = ttu(R) on an attribute set 
U of a relation R represented by the WSD C is translated 
into (1) the extension of C with the copy P of R, and (2) 
projections on the components of C, where all component 
attributes that do not refer to attributes of P in U are dis- 
carded. Before removing attributes, however, we need to 
propagate ±-values, as discussed in the following example. 

Example 4.4 Consider the 3-WSD of Figure [H] (a) repre- 
senting a set of two worlds for R, where one world contains 
only the tuple t\ and the other contains only the tuple t%. Let 
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Figure 13. P = o A=B {R) with R from FigureQOj 
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(a) WSD of two relations J? and S. 
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(b) WSD of their product Rx S. 
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Figure 14. The product operation Rx S. 



P' represent the first two components of R, which contain 
all values for the attribute A in both tuples. The relation P' 
is not the answer to tta(R), because it encodes one world 
with both tuples, and the information from the third com- 
ponent of R that only one tuple appears in each world is 
lost. To compute the correct answer, we progressively (1) 
compose the components referring to the same tuple (in this 
case all three components), (2) propagate ±-values within 
the same tuple, and (3) project away the irrelevant attributes. 
The correct answer P is given in Figure [15] (b). □ 
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(a) WSD for R. (a) WSD for P. 

Figure 15. Projection P := w A (R). 

The algorithm for projection is given in Figure [9] For 
each tuple tj, attribute A in the projection list, and attribute 
B not in the projection list, the algorithm first propagates 
the ±-values of P.ti.B of component C to P.ti.A of com- 
ponent C. If C and C are the same, the propagation is 
done locally within the component. Otherwise, C and C 
are merged before the propagation. Note that the propaga- 
tion is only needed if some tuples of C have at _L-value 
for ti.B. This procedure is performed until no other compo- 
nents C and C exist that satisfy the above criteria. After 
the propagation phase, the attributes not in the projection 
list are dropped from all remaining components. 
Union . The algorithm for computing the union T := R U 
S of two relations R and S works similarly to that for the 
product. Each component C containing values of R or S is 



extended such that in each world of C all values of R and S 
become also values of T. 

Renaming. The operation Sa^A' (R) renames attribute A 
of relation R to A' by renaming all attributes R.t.A in a 
component C to R.t.A' . 

Difference . To compute the difference operation P := 
R — S we scan and compose components of the two re- 
lations R and S. For the worlds where a tuple t from R 
matches some tuple from S, we place _L-values to denote 
that t is not in these worlds of P; otherwise t becomes a 
tuple of P. The difference is by far the least efficient oper- 
ation to implement, as it can lead to the composition of all 
components in the WSD. 

5 Efficient Query Evaluation on UWSDTs 

The algorithms for computing the relational operations 
on WSDs presented in Section [4] can be easily adapted to 
UWSDTs. To do this, we follow closely the mapping of 
WSDs, represented as sets of components C, to equivalent 
UWSDTs, represented by a triple (F,C,W) and at least one 
template relation R°: 

• Consider a component K of WSD C having an at- 
tribute R.t.A with a value v. In the equivalent 
UWSDT, this value can be stored in the template re- 
lation R° if v is the only value of R.t.A, or in the com- 
ponent C otherwise. In the latter case, the template 
R° contains the placeholder R.t.A in the tuple t. In 
addition, in the mapping relation F there is an entry 
with the placeholder R.t.A and a component identifier 
c, and C contains a tuple formed by R.t.A, the value v 
and a world identifier w. 

• Worlds of different sizes are represented in WSDs by 



allowing _L values in components, and in UWSDTs by 
allowing for a same placeholder different amount of 
values in different worlds. 

Any relational query is rewritten in our framework to a 
sequence of SQL queries, except for the projection and se- 
lection with join conditions, where the fixpoint computa- 
tions are encoded as recursive PL/SQL programs. In all 
cases, the size of the rewriting is linear in the size of the 
input query. Figure [16] shows the implementation of the se- 
lection with constant on UWSDTs. 



Figure 16. Evaluating P ■= a Agc (R) on UWS- 
DTs. 

In contrast to some algorithms of Figure|9] for UWSDTs 
we do not create a copy P of R at the beginning, but rather 
compute directly P from R using standard relational alge- 
bra operators. The template P° is initially the set of tuples 
of R° that satisfy the selection condition, or have a place- 
holder '?' for the attribute A (line 1). We extend the map- 
ping relation F with the placeholders of P° (line 2), and the 
component relation C with the values of these placeholders, 
where the values of placeholders P.t.A for the attribute A 
must satisfy the selection condition (line 3). If a placeholder 
P.t.A has no value satisfying the selection condition, then 
t is removed from P° (line 6) and all placeholders of t are 
removed from F (line 5) together with their values from C 
(line 4). 

Many of the standard query optimization techniques are 
also applicable in our context. For our experiments reported 
in Section [8] we performed the following optimizations on 
the sequences of SQL statements obtained as rewritings. 
For the evaluation of a query involving join, we merge the 
product and the selections with join conditions and dis- 
tribute projections and selections to the operands. When 
evaluating a query involving several selections and projec- 
tions on the same relation, we again merge these operators 
and perform the steps of the algorithm of Figure [16] only 
once. We further tuned the query evaluation by employing 



indices and materializing often used temporary results. 

6 Confidence Computation in Probabilistic 
WSDs 

Section |4] discussed algorithms for evaluating relational 
algebra queries on top of WSDs. Since we consider queries 
that transform worlds, the algorithms were independent of 
whether or not probabilities were stored with the data. A 
different class of queries are ones that compute confidence 
of tuples. The confidence of a tuple t in the result of a query 
Q is defined as the sum of the probabilities of the worlds 
that contain t in the answer to Q. Clearly, iterating over 
all possible worlds is infeasible. We therefore adopt an ap- 
proach where we only iterate over the local worlds of the 
relevant components. 



// compute the confidence of tuple t 

algorithm confit) 

begin 

c := 0; 

letti, . . . , tk be the tuple ids 

that match t in some world; 
let C\ , . . . , Cn be the components 

for the fields of ti , . . . , tk\ 
let C := compose(Ci, . . . , C„); 
for each to in C do begin 
if t = (to-(U.Ai), .. ., tc.(U.A m )) for some i 
then c := c + tc-Pr; 
end 

return c; 
end 



Figure 17. Computing confidence of possible 
tuples. 

Figure [T7] shows an algorithm for computing the confi- 
dence of tuple t of schema (A\ , . . . , A m ). It first finds those 
tuple ids ti, . . . , that match the given tuple t in some 
world and composes all components defining fields of those 
tuple ids into one component C. A world that contains t is 
thus obtained whenever we select a local world from C that 
makes the value of some tuple id ti, 1 < i < m, equal to 
t. Fixing a local world in C defines a set of possible worlds 
- the ones that share the values specified by the selected lo- 
cal world. The probability of this set of worlds is given in 
the Pr field of the local world. Since the local worlds of a 
component define non-overlapping sets of worlds, to com- 
pute the confidence of t we need to sum up the probabilities 
of those local worlds that define t. 

Note that the algorithms for computing tuple confidence 
in ITT2 1 rely heavily on the fact that input tuples are indepen- 



algorithm select[^4^c] // compute P := oabcR 
begin 

1. P° := O-A0cVA=?R°', 

2. F := F U {(P.t.B, k) | (R.t.B, k) e F,t G P }; 

3. C := C U {(P.t.B, w, v) \ (R.t.B, w,v)eC,t£P°, 

(B = A^ v9c)}; 
II Remove incomplete world tuples 

4. C := C - {(P.t.X, w,v) eC \ (P.t.X, k), (P.t.Y, k) e F, 

teP°,X + Y, ft)' : (P.t.Y, w, v') G C}; 

5. F ;= F — {(P.t.B, k) | (P.t.B, k) G F, 

fiw,V : (P.t.B, w, v) G C}; 

6. P° := P° - {t | t € P°, flB,a: (P.t.B, a) G F}; 
end 



dent. Tuple confidence is computed during the evaluation of 
the query in question to avoid having to store intermediate 
results. This restricts the supported types of queries and the 
query plans that can be used. In probabilistic WSDs on the 
other hand, the query evaluation can be completely decou- 
pled from confidence computation, since the latter form a 
strong representation system. For the same reason we need 
no independence assumptions about the input data. 

The algorithm of Figure [17] does not explore possible 
independence between tuples. One can design a better 
approach in the following way. In a probabilistic WSD 
each component id corresponds to an independent random 
variable, whose possible outcomes are the local worlds of 
the component. We will call a world-set descriptor (ws- 
descriptor) a set 

{(C 1 ,L 1 ) 7 ...,(C n ,L n )} 

where C, is a component id, Li is a local world id of Ci, 
and no two elements (d, Li), (Cj , Lj) of the set exist with 
Ci = Cj and Li ^ Lj. A ws-descriptor defines, as its name 
suggests, a set of possible worlds, whose probability can be 
computed as the product of the probabilities of the selected 
local worlds: 

n 

P({(C 1 ,L 1 ), (C n ,L n )}) = [] P(C U Li) 

1=1 

A ws-descriptor that specifies a local world for each com- 
ponent id of a probabilistic WSD corresponds to a single 
world. For computing tuple confidence we need to also con- 
sider sets of ws-descriptors. A ws-descriptor set defines a 
set of possible worlds - the union of the worlds defined by 
each descriptor in the set. Given a fixed tuple t and a proba- 
bilistic world-set decomposition W representing the answer 
R to query Q, we compute a ws-descriptor set D for the 
worlds containing t in the following way. Let ti be a tuple id 
of R and C\ 1 , . . . , C, . be the components of W that define 
fields of R.ti. If the value of ti is t when we fix the local 
world of Ci k to be Li k for 1 < k < j, respectively, then 
D contains the ws-descriptor {(C^, L^), ... , (Ci , Li )}. 
The confidence of t is then computed as the probability of 
the worlds defined by D. Computing tuple confidence can 
be reduced to computing the probability of a formula in dis- 
junctive normal form, which is known to have #P complex- 
ity. This follows from the mutual reducibility of the prob- 
lem of computing the probability of the union of the (pos- 
sibly overlapping) world-sets represented by a set of ws- 
descriptors and of the #P-complete problem of counting the 
number of satisfying assignments of Boolean formulas in 
disjunctive normal form. Indeed, we can encode a set of k 
ws-descriptors {{(C^, L ix ), . . . , (C ij: Ly)}}, 1 <i < k 

as a formula V (Ci 1 — Li 1 A ... A Ci = Li ). Different 

i<i<fe 



optimization techniques exist for computing the probabil- 
ity of a boolean formula, such as variable elimination and 
Monte Carlo approximations [19|. 

Remark 6.1 The U-relations of [5 1 associate each possible 
combination of values with a ws-descriptor. In WSDs and 
UWSDTs on the other hand a combination of values is as- 
sociated with a single pair of component and local world id. 
Thus WSDs form a special case of U-relations with depen- 
dency vectors of size one. □ 

We next consider the operator possible that computes the 
tuples appearing in at least one world of the world-set. For- 
mally, if R is a relation name and A - a world-set, the oper- 
ator possible is defined as: 

possible(i?)(A) := {t | A e A, t e R A } 



II compute P := possible(i?) 

algorithm possible 

begin 

P ~ 0; 

for each 1 < i < \R\max do begin 

let Ci, . . . , Ck be the components for 

the fields of R.ti, 
let C ■— compose(Ci, . . . ,C k ); 
add n R .t l .A 1 ,...,R.t i .A m (cr/ Kj R.t z .A j ^±(C)) to P; 
end 
end 



Figure 18. Computing possible tuples. 

Figure[l8]shows an algorithm for computing possible tu- 
ples in the non-probabilistic case. For each tuple id ti for R 
we compose the components defining fields of ti to obtain 
the possible values for ti. 



//compute P := possible p (i?) 

algorithm possible^ 

begin 

P ■■= 0; 

for each distinct t in possible(i?) do begin 

add (t,conf(t)) to P; 

end 



Figure 19. Computing possible tuples to- 
gether with their confidence. 

In the probabilistic case the operator possible can be ex- 
tended to compute the confidence of the possible tuples. To 



algorithm remove_invalid_tuples 
begin 

for each 1 < i < \P\ ma x and A £ S{P) do begin 
let C be the component of P.ti.A; 
if Tvp. ti .A = {-L} then 
for each B £ S(P) do begin 
let C' be the component of P.ti.B; 
project away P.ti.B from C"; 
end 

end 
end 

algorithm decompose 
begin 

while no fixpoint is reached do begin 
let C be a component such that 

C = compose(Ci, C 2 ); 
replace C by C\ , Ci ; 
end 
end 

algorithm compress 
begin 

while no fixpoint is reached do begin 
let C be a component, wi , W2 £ C such that 
Wi.A = ™ 2 -4 for all A £ S(C),A^ Pr\ 
let w be a tuple such that tu.Pr := w\.Pr + w^-Pr, 

w.A := wi.A for all A £ S(C), A Pr\ 
replace Wi,W2 in C by w; 
end 
end 



do that, we compute the confidence of each tuple t, which 
is a possible answer to Q. Figure [19] shows an algorithm 
implementing the operator possible in the probabilistic case 
that computes the possible tuples together with their confi- 
dence. For computing the confidence conf(t) of tuple t we 
can plug in any exact or approximate algorithm, e.g. the one 
from Figure [17] 

Example 6.2 Consider the probabilistic WSD of Figure [4] 
query Q = TTg(R), and tuple t = (185). Let C\ denote the 
first component. This component represents the answer to 
the projection query. There are two tuple ids whose values 
match the given tuple t, and they are already defined in the 
same component C\. To compute the confidence of t we 
therefore need to sum up the probabilities of the first and 
second local world, obtaining 0.2 + 0.4 = 0.6. The fol- 
lowing table contains the possible tuples in the answer to Q 
together with their confidence: 



Q 


s 


conf 




185 


0.6 




186 


0.6 




785 


0.8 



□ 

7 Normalizing probabilistic WSDs 

The normalization of a WSD is the process of finding 
an equivalent probabilistic WSD that takes the least space 
among all its equivalents. Examples of not normalized 
WSDs are non-maximal WSDs or WSDs defining invalid 
tuples (i.e., tuples that do not appear in any world). Note 
that removing invalid tuples and maximizing world-set de- 
compositions can be performed in polynomial time (6|. 

Figure [20] gives three algorithms that address these nor- 
malization problems. The third algorithm scans for identi- 
cal tuples in a component and compresses them into one by 
summing up their probabilities. 

Example 7.1 The WSD of FigureQj](a) nas only -L-values 
for P.t^.C. This means that the tuple t% of P is absent (or 
invalid) in all worlds and can be removed. The equivalent 
WSD of Figure [2T] shows the result of this operation. Simi- 
lar simplifications apply to the WSD of Figure[TT](b), where 
tuples t% and t§ are invalid. □ 

Example 7.2 The 4- WSD of Figure Qj] admits the equiva- 
lent 5-WSD, where the third component is decomposed into 
two components. This non-maximality case cannot appear 
for UWSDTs, because all but the first component contain 
only one tuple and are stored in the template relation, where 
no component merging occurs. □ 



Figure 20. Algorithms for WSD normalization. 
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Figure 21. Normalization of WSD of Fig- 
ure[Tl](a). 



8 Experimental Evaluation 

The literature knows a number of approaches to repre- 
senting incomplete information databases, but little work 
has been done so far on expressive yet efficient represen- 
tation systems. An ideal representation system would allow 
a large set of possible worlds to be managed using only a 
small overhead in storage space and query processing time 
when compared to a single world represented in a conven- 
tional way. In the previous sections we presented the first 
step towards this goal. This section reports on experiments 



with a large census database with noise represented as a 
UWSDT. 

Setting. The experiments were conducted on a 3GHz/ 
2GB Pentium machine running Linux 2.6.8 and Post- 
greSQL 8.0. 

Datasets . The IPUMS 5% census data (Integrated Pub- 
lic Use Microdata Series, 1990) |20| used for the experi- 
ments is the publicly available 5% extract from the 1990 US 
census, consisting of 50 (exclusively) multiple-choice ques- 
tions. It is a relation with 50 attributes and 12491667 tuples 
(approx. 12.5 million). The size of this relation stored in 
PostgreSQL is ca. 3 GB. We also used excerpts represent- 
ing the first 0.1, 0.5, 1, 5, 7.5, and 10 million tuples. 
Adding Incompleteness. We added incompleteness as fol- 
lows. First, we generated a large set of possible worlds by 
introducing noise. After that, we cleaned the data by re- 
moving worlds inconsistent with respect to a given set of 
dependencies. Both steps are detailed next. 

We introduced noise by replacing some values with or- 
sets^. We experimented with different noise densities: 
0.005%, 0.01%, 0.05%, 0.1%. For example, in the 0.1% 
scenario one in 1000 fields is replaced by an or-set. The 
size of each or-set was randomly chosen in the range 
[2,min(8, size)], where size is the size of the domain of 
the respective attribute (with a measured average of 3.5 val- 
ues per or-set). In one scenario we had far more than 2 624449 
worlds, where 624449 is the number of the introduced or- 
sets and 2 is the minimal size of each or-set (cf. Figure l22l. 

We then performed data cleaning using 12 equality gen- 
erating dependencies, representing real-life constraints on 
the census data. Note that or-set relations are not expressive 
enough to represent the cleaned data with dependencies. 

To remove inconsistent worlds with respect to given de- 
pendencies, we adapted the Chase technique [2 1 to the con- 
text of UWSDTs. We explain the Chase by an example. 
Consider the dependency WWII = 1 => MILITARY != 4 
that requires people who participated in the second world 
war to have completed their military service. Assume now 
the dependency does not hold for a tuple t in some world 
and let C\ and C2 be the components defining i.WWII 
and t. MILITARY, respectively. First, the Chase computes 
a component C that defines both i.WWII and ^MILITARY 
In case C\ and C2 are different, they are replaced by a new 
component C = C\ x C 2 ; otherwise, C is C\. The Chase 
removes then from C all inconsistent worlds w, i.e., worlds 
where tu.WWII = 1 and ^.MILITARY = 4. Repeating these 
steps iteratively for each dependency on a given UWSDT 
yields a UWSDT satisfying all dependencies. 

Figure |221 shows the effect of chasing our dependencies 
on the 12.5 million tuples and varying placeholder density. 
As a result of merging components, the number of com- 

2 We consider it infeasible both to iterate over all worlds in secondary 
storage, or to compute UWSDT decompositions by comparing the worlds. 
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Figure 22. UWSDTs characteristics for 12.5M 
tuples. 



ponents with more than one placeholder (#comp>l) grows 
linearly with the increase of placeholder density, reaching 
about 1.7% of the total number of components (#comp) in 
the 0.1% case. A linear increase is witnessed also by the 
chasing time when the number of tuples is also varied. 
Queries. Six queries were chosen to show the behavior of 
relational operators combinations under varying selectivi- 
ties (cf. Figure l23l. Query Q\ returns the entries of US cit- 
izens with PhD degree. The less selective query Q2 returns 
the place of birth of US citizens born outside the US that 
do not speak English well. Query Q3 retrieves the entries 
of widows that have more than three children and live in 
the state where they were born. The very unselective query 
Q4 returns all married persons having no children. Query 
Q5 uses query Q2 and Q3 to find all possible couples of 
widows with many children and foreigners with limited En- 
glish language proficiency in US states with IPUMS index 
greater than 50 (i.e., eight 'states', e.g., Washington, Wis- 
consin, Abroad). Finally, query Qq retrieves the places of 
birth and work of persons speaking English well. 

Figure |22] describes some characteristics of the answers 
to these queries when applied on the cleaned 12. 5M tu- 
ples of IPUMS data: the total number of components 




Figure 24. The evaluation time for queries of Figure |23lon UWSDTs of various sizes and densities. 



Ql '■= CT YEARSCH=17ACITIZEN=o(-R) 

Q2 ■= i"powstate,citizen,immigr(°"citizen<>oaenglish>3(-R)) 
Qa ■= i"powstate,marital,fertil(°"powstate=pob 

( CT FERTIL>4AMARITAL=l(-R))) 
Qi ■= °"FERTIL=1A(RSP0USE=1VRSP0USE=2)(^) 
Q5 ■= <5poWSTATE^P! (°"POWSTATE>50( ( 32)) Mp 1= p 2 

<5pOWSTATE^P 2 ( f7 POWSTATE>50( < 33)) 

Qe ■= ^powstate,pob(< t english=3(^)) 



Figure 23. Queries on IPUMS census data. 



(#comp) and of components with more than one placeholder 
(#comp>l), the size of the component relation C, and the 
size of the template relation R. One can observe that the 
number of components increases linearly with the place- 
holder density and that compared to chasing, query evalua- 
tion leads to a much smaller amount of component merging. 

Figure [24] shows that all six queries admit efficient and 
scalable evaluation on UWSDTs of different sizes and 
placeholder densities. For accuracy, each query was run ten 
times, and the median time for computing and storing the 
answer is reported. The evaluation time for all queries but 
Q5 on UWSDTs follows very closely the evaluation time 
in the one-world case. The one-world case corresponds to 
density 0% in our diagrams, i.e., when no placeholders are 
created in the template relation and consequently there are 
no components. In this case, the original queries (that is, 



not the rewritten ones) of Figure [23] were evaluated only on 
the (complete) template relation. 

An interesting issue is that all diagrams of Figure [24] 
show a substantial increase in the query evaluation time 
for the 7.5M case. As the jump appears also in the one- 
world case, it suggests poor memory management of Post- 
gres in the case of large tables. We verified this statement by 
splitting the 12. 5M table into chunks smaller than 5M and 
running query Q\ on those chunks to get partial answers. 
The final answer is represented then by the union of each 
UWSDT relation from these partial answers. 

Although the evaluation of join conditions on UWSDTs 
can require theoretically exponential time (due to the com- 
position of some components), our experiments suggest that 
they behave well in practical cases, as illustrated in Fig- 
ures[24](c) and (e) for queries Q3 and Q5 respectively. Note 
that the time reported for Q5 does not include the time to 
evaluate its subqueries Q2 and Q3. 

In summary, our experiments show that UWSDTs be- 
have very well in practice. We found that the size of 
UWSDTs obtained as query answers remains close to that 
of one of their worlds. Furthermore, the processing time 
for queries on UWSDTs is comparable to processing one 
world. The explanation for this is that in practice there are 
rather few differences between the worlds. This keeps the 
mapping and component relations relatively small and the 
lion's share of the processing time is taken by the templates, 
whose sizes are about the same as of a single world. 



9 Application Scenarios 
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Our approach is designed to cope with large sets of pos- 
sible worlds, which exhibit local dependencies and large 
commonalities. This data pattern can be found in many ap- 
plications. In addition to the census scenario used in Section 
[8] we next discuss two further application scenarios that can 
profit from our approach. As for the census scenario, we 
consider it infeasible both to iterate over all possible worlds 
in secondary storage, or to compute UWSDT decomposi- 
tions by comparing the worlds. Thus we also outline how 
our UWSDTs can be efficiently computed. 
Inconsistent databases. A database is inconsistent if it 
does not satisfy given integrity constraints. Sometimes, 
enforcing the constraints is undesirable. One approach to 
manage such inconsistency is to consider so-called minimal 
repairs, i.e., consistent instances of the database obtained 
with a minimal number of changes [7|. A repair can there- 
fore be viewed as a possible (consistent) world. The num- 
ber of possible minimal repairs of an inconsistent database 
may in general be exponential; however, they substantially 
overlap. For that reason repairs can be easily modeled 
with UWSDTs, where the consistent part of the database 
is stored in template relations and the differences between 
the repairs in components. Current work on inconsistent 
databases [7] focuses on finding consistent query answers, 
i.e., answers appearing in all possible repairs (worlds). With 
our approach we can provide more than that, as the answer 
to a query represents a set of possible worlds. In this way, 
we preserve more information that can be further processed 
using querying or data cleaning techniques. 
Medical data. Another application scenario is modeling in- 
formation on medications, diseases, symptoms, and medical 
procedures, see, e.g., JT|. A particular characteristic of such 
data is that it contains a big number of clusters of interde- 
pendent data. For example, some medications can interact 
negatively and are not approved for patients with some dis- 
eases. Particular medical procedures can be prescribed for 
some diseases, while they are forbidden for others. In the 
large set of possible worlds created by the complex interac- 
tion of medications, diseases, procedures, and symptoms, a 
particular patient record can represent one or a few possible 
worlds. Our approach can keep interdependent data within 
components and independent data in separate components. 
One can ask then for possible patient diagnostics, given an 
incompletely specified medical history of the patient, or for 
commonly used medication for a given set of diseases. 

In HI interdependencies of medical data are modeled as 
links. A straightforward and efficient approach to wrap such 
data in UWSDTs is to follow the links and create one com- 
ponent for all interrelated values. Additionally, each differ- 
ent kind of information, like medications, diseases, is stored 
in a separate template relation. 
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